Skip to content

Conversation

desertfire
Copy link
Contributor

@desertfire desertfire commented Oct 10, 2025

Stack from ghstack (oldest at bottom):

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example (tested on H100),

With the bfloat16 format, here is the generated ptd file size and latency.

optimum-cli export executorch \
    --model "mistralai/Voxtral-Mini-3B-2507" \
    --task "multimodal-text-to-text" \
    --recipe "cuda" \
    --dtype bfloat16 \
    --device cuda \
    --max_seq_len 1024 \
    --output_dir ./

aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128

With --qlinear 4w_hqq --qlinear_encoder 4w_hqq, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.

aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590

Here is the result with --qlinear 4w --qlinear_encoder 4w, where weights are quantized, but the linear is done with dequant + fp16_matmul. Comparing with 4w_hqq, the generated file size is a bit larger, but the computation is surprisingly faster. Needs more investigation.

aoti_cuda_blob.ptd: 5.4G 

Program load latency (ms): 0.064
Method load latency (ms):
  audio_encoder: 872.016
  token_embedding: 663.107
  text_decoder: 3104.973
Run latency (ms):
  audio_encoder: 75.777
  token_embedding: 4.067
  text_decoder: 149.420

Differential Revision: D84395275

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Oct 10, 2025
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

ghstack-source-id: a543a05
Pull Request resolved: #15030
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 10, 2025
Copy link

pytorch-bot bot commented Oct 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15030

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 4 Unrelated Failures

As of commit 7dbdad2 with merge base afd98fe (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@desertfire
Copy link
Contributor Author

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cuda_shim_cpp_unittest("aoti_torch__reinterpret_tensor")
cuda_shim_cpp_unittest("aoti_torch_copy_")
cuda_shim_cpp_unittest("aoti_torch_cuda_guard")
cuda_shim_cpp_unittest("aoti_torch_cuda__weight_int4pack_mm")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larryliu0820 , I didn't find a CMakeLists.txt for all these unit tests. I suppose we can only test them in fbcode?

Copy link
Contributor

@mergennachin mergennachin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline for additional documentations (i used claude code to generate docs)

This is great, thank you!

ret0 != nullptr,
InvalidArgument,
"aoti_torch_cuda__weight_int4pack_mm failed: ret0 is null");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ET_CHECK_OR_RETURN_ERROR(
        qGroupSize == 32 || qGroupSize == 64 || qGroupSize == 128 || qGroupSize == 256,
        InvalidArgument,
        "aoti_torch_cuda__weight_int4pack_mm: qGroupSize must be 32/64/128/256, got %lld",
        static_cast<long long>(qGroupSize));

#endif

AOTITorchError aoti_torch_cuda__weight_int4pack_mm(
Tensor* self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should check whether self is bfloat16?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have quite a few tensor checking in the actual _weight_int4pack_mm_cuda function, so we don't have to do repeat them here?


AOTITorchError aoti_torch_cuda__weight_int4pack_mm(
Tensor* self,
Tensor* mat2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check whether mat2 is int32

@mergennachin
Copy link
Contributor

mergennachin commented Oct 12, 2025

Wait, why is the "Run latency" slower than in int4 cc @swolchok

Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275)

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Oct 13, 2025
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

ghstack-source-id: a0c94a0
Pull Request resolved: #15030
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275)

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Oct 13, 2025
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.

Using the Voxtral runner as an example,

With the bfloat16 format, here is the generated ptd file size and latency.
```
aoti_cuda_blob.ptd: 9.0 GB

Program load latency (ms): 0.054
Method load latency (ms):
  audio_encoder: 1492.989
  token_embedding: 803.561
  text_decoder: 6556.770
Run latency (ms):
  audio_encoder: 76.848
  token_embedding: 6.479
  text_decoder: 149.128
```

With `--qlinear 4w_hqq  --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.
```
aoti_cuda_blob.ptd: 3.7 GB

Program load latency (ms): 0.051
Method load latency (ms):
  audio_encoder: 716.667
  token_embedding: 633.476
  text_decoder: 1840.760
Run latency (ms):
  audio_encoder: 329.274
  token_embedding: 4.285
  text_decoder: 335.590
```

ghstack-source-id: 29b5b16
Pull Request resolved: #15030
@desertfire
Copy link
Contributor Author

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@desertfire
Copy link
Contributor Author

Wait, why is the "Run latency" slower than in int4 cc @swolchok

@jerryzh168 , I did a quick nsys profile and found out tinygemm_m16n8k16_chunk_kernel is now the top1 kernel (87.4% execution time among all the CUDA kernels) and seems pretty slow (0.552s). Is this something you have seen before?

@jerryzh168
Copy link
Contributor

Wait, why is the "Run latency" slower than in int4 cc @swolchok

@jerryzh168 , I did a quick nsys profile and found out tinygemm_m16n8k16_chunk_kernel is now the top1 kernel (87.4% execution time among all the CUDA kernels) and seems pretty slow (0.552s). Is this something you have seen before?

this kernel is only optimized for batch size 1, is this what you are testing?

@jerryzh168
Copy link
Contributor

depends on the hardware I think, if it has to be A100, then seems like the only other option is gemlite kernels, which is written in triton

if H100, we can integrate fbgemm kernels, the effort will be similar to the current PR I think.

@meta-codesync meta-codesync bot merged commit 63b8a91 into gh/desertfire/1/base Oct 14, 2025
130 of 139 checks passed
@meta-codesync meta-codesync bot deleted the gh/desertfire/1/head branch October 14, 2025 02:14
desertfire added a commit that referenced this pull request Oct 14, 2025
This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: #15030 by
@desertfire
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/desertfire/1/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/desertfire/1/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/desertfire/1/orig
Differential Revision:
[D84395275](https://our.internmc.facebook.com/intern/diff/D84395275)
@diff-train-skip-merge

Co-authored-by: Bin Bao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants